Fix cluster restart failure after bootstrap with built-in log service by Copilot · Pull Request #266 · eloqdata/tx_service

Copilot · 2025-12-05T06:16:30Z

Here are some reminders before you submit the pull request

Add tests for the change
Document changes
Reference the link of RFC if exists
Pass ./mtr --suite=mono_main,mono_multi,mono_basic

Description

During bootstrap, log service raft groups undergo multiple leader elections. On restart, tx_service initializes with stale leader cache, causing connection failures to log groups.

Changes

Added proactive leader discovery after log_agent initialization in Sharder::Init():

if (log_agent_ != nullptr)
{
    log_agent_->Init(txlog_ips_, txlog_ports_, 0);
    
    // Refresh leader info for all log groups
    uint32_t log_group_count = log_agent_->LogGroupCount();
    if (log_group_count == 0)
    {
        LOG(WARNING) << "No log groups found after log agent initialization. "
                        "Log service may not be properly configured.";
    }
    else
    {
        LOG(INFO) << "Refreshing leader info for " << log_group_count
                  << " log group(s) after initialization";
        for (uint32_t lg_id = 0; lg_id < log_group_count; ++lg_id)
        {
            log_agent_->RefreshLeader(lg_id);
            DLOG(INFO) << "Refreshed leader for log group " << lg_id;
        }
    }
}

Key points:

RefreshLeader() queries current leader from each log group immediately after initialization
Prevents connection failures from stale leader cache on restart
Added validation for zero log groups to detect configuration issues
Operation is lightweight (cache update only); error handling delegated to log service layer

Impact

Resolves connection failures during cluster restart after bootstrap. Part of coordinated fix with log_service #27, eloq_log_service #33, eloqdoc #345, eloqkv #329, eloqsql #177.

Original prompt

This section details on the original issue you should resolve

<issue_title>[Bug]: Fix cluster start failure after bootstrap with built-in log service</issue_title>
<issue_description>- Software version (the version of the software you're running):(required)
e.g. 0.0.1

Reproduce steps (describe steps of reproducing the bug):(required)
- Bootstrap with eloqdoc cluster of 3 nodes config and built-in log service.
- Restart eloqdoc cluster failed with sync raft log error.
- The log service raft group changes caused the issue.
Expected result (describe the right result as expected):(required)
Actual result (describe the incorrect result actually happened):(required)
Related issues (references to the related issues, e.g. eloqdb/repository#number):(optional)
</issue_description>

Comments on the Issue (you are @copilot in this section)

@githubzilla # Log Service Raft Group Leader Change Analysis

Cluster Configuration

Log Group (lg0): Single log group with 3 nodes
Node 0: 127.0.0.1:16381
Node 1: 127.0.0.1:17381
Node 2: 127.0.0.1:18381
Group Configuration: 127.0.0.1:16381:0:0,127.0.0.1:17381:0:0,127.0.0.1:18381:0:0

Timeline of Leader Changes

Phase 1: Initial Startup (Single Node)

Time: 15:11:18.727 - Node 0 becomes leader (Term 3)

Node 0 starts as a single-node group
Becomes leader with term 3
Configuration: Only 127.0.0.1:16381:0:0

Log Entry (tx_log_0.log:39):

node lg0:127.0.0.1:16381:0:0 term 3 become leader of group 127.0.0.1:16381:0:0

Log Entry (tx_log_0.log:53):

Log node 127.0.0.1:16381 becomes the leader of the log group eloqdata/tx_service#0, term: 3, leader lease valid? yes

Phase 2: Cluster Expansion - Node 1 Joins

Time: 15:11:30.818 - Node 1 becomes leader (Term 4)

Node 0 was leader but receives PreVote/RequestVote from Node 1 with term 4
Node 0 steps down: "stops being the leader" - reason: "Raft node receives higher term request_vote_request"
Node 1 wins election and becomes leader

Log Entry (tx_log_0.log:247):

Log node 127.0.0.1:16381 becomes the follower of the group eloqdata/tx_service#0, term: 4

Log Entry (tx_log_1.log:161):

node lg0:127.0.0.1:17381:0:0 term 4 become leader of group 127.0.0.1:16381:0:0,127.0.0.1:17381:0:0,127.0.0.1:18381:0:0

Log Entry (tx_log_1.log:166):

Log node 127.0.0.1:17381 becomes the leader of the log group eloqdata/tx_service#0, term: 4, leader lease valid? yes

Leader Refresh (tx_log_0.log:263):

Refresh log group:0 leader from node_id: 0 to node_id: 1

Leader Refresh (tx_log_1.log:182):

Refresh log group:0 leader from node_id: 0 to node_id: 1

Observation: Node 0 had log entries up to index 11 from term 2, but Node 1's term 4 entries start from index 1, causing log mismatch issues when Node 1 tries to replicate.

Phase 3: Leadership Transfer Request

Time: 15:11:35.837 - Leadership Transfer Initiated

Node 1 (current leader) starts transferring leadership to Node 0

Log Entry (tx_log_1.log:183):

node lg0:127.0.0.1:17381:0:0 starts to transfer leadership to 127.0.0.1:16381:0:0

Log Entry (tx_log_1.log:184):

Log node 127.0.0.1:17381 stops being the leader of the log group eloqdata/tx_service#0, status: Raft leader is transferring leadership to 127.0.0.1:16381:0:0

Node 0 attempts to become leader through election

Phase 4: Node 0 Wins Election

Time: 15:11:37.827 - Node 0 becomes leader (Term 5)

After leadership transfer, Node 0 wins election with term 5

Log Entry (tx_log_0.log:274):

node lg0:127.0.0.1:16381:0:0 term 5 become leader of group 127.0.0.1:16381:0:0

Log Entry (tx_log_0.log:276):

Log node 127.0.0.1:16381 becomes the leader of the log group eloqdata/tx_service#0, term: 5, leader lease valid? yes

Leader Refresh (tx_log_0.log:285):
```
Update log group:0 leader to node_id:0
```

Issue: Node 0 still has log entries from term 2 (index 13), but Node 2 rejects AppendEntries because of term mismatch (Node 2 expects term 6 entries but Node 0 has term 2 entries).

Phase 5: Node 2 Wins Election

Time: 15:11:39.785 - Node 2 becomes leader (Term 6)

Node 2 initiates election with term 6
Node 0 receives PreVote/RequestVote from Node 2, steps down

Log Entry (tx_log_0.log:303):

Log node 127.0.0.1:16381 stops being the leader of the log group eloqdata/tx_service#0, status: Raft node receives higher term request_vote_request.

Log Entry (tx_log_0.log:308):

Log node 127.0.0.1:16381 becomes the follower of the group eloqdata/tx_service#0, term: 6

Log Entry (tx_log_1.log:198):

Log node 127.0.0.1:17381 becomes the follower of the group eloqdata/tx_service#0, term: 6

**Log Entry (tx_log_2.log:17...

Fixes [Bug]: Fix cluster start failure after bootstrap with built-in log service #264

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

coderabbitai · 2025-12-05T06:16:36Z

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

This fix ensures that after initializing the log agent, we immediately refresh the leader information for all log groups. This is critical after cluster restart scenarios where log group leaders may have changed during the previous session (e.g., during bootstrap). The RefreshLeader call will query the current leader from each log group, updating the leader cache with accurate information before any log operations begin. This prevents connection failures that occur when trying to communicate with stale/incorrect leader nodes. Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>

CLAassistant · 2025-12-05T06:22:43Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

CLAassistant · 2025-12-05T06:22:44Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Added INFO and DEBUG level logging to track leader refresh operations for better observability and debugging. This helps verify that leader discovery is working correctly during cluster initialization and restart scenarios. Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>

- Add warning when no log groups are found to detect configuration issues - Add comment explaining RefreshLeader is lightweight and handles errors - Improve code robustness and observability Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>

- Fix multi-line string formatting for better readability - log_group_count is already stored in a variable (no redundant calls) - Improved code consistency Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>

The _codeql_detected_source_root file is a temporary artifact from security scanning and should not be committed to the repository. Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>

Initial plan

346d85a

Copilot AI assigned Copilot and githubzilla Dec 5, 2025

Copilot started work on behalf of githubzilla December 5, 2025 06:16 View session

Copilot AI and others added 4 commits December 5, 2025 06:23

Polish code formatting and style

ba9e23e

- Fix multi-line string formatting for better readability - log_group_count is already stored in a variable (no redundant calls) - Improved code consistency Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>

Remove codeql temporary file and add to gitignore

5adce30

The _codeql_detected_source_root file is a temporary artifact from security scanning and should not be committed to the repository. Co-authored-by: githubzilla <10080336+githubzilla@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix cluster start failure after bootstrap with log service~~ Fix cluster restart failure after bootstrap with built-in log service Dec 5, 2025

Copilot AI requested a review from githubzilla December 5, 2025 06:29

Copilot finished work on behalf of githubzilla December 5, 2025 06:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cluster restart failure after bootstrap with built-in log service#266

Fix cluster restart failure after bootstrap with built-in log service#266
Copilot wants to merge 6 commits intomainfrom
copilot/fix-cluster-start-failure

Copilot AI commented Dec 5, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 5, 2025

Review skipped

Uh oh!

CLAassistant commented Dec 5, 2025

Uh oh!

CLAassistant commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Here are some reminders before you submit the pull request

Description

Changes

Impact

Comments on the Issue (you are @copilot in this section)

Cluster Configuration

Timeline of Leader Changes

Phase 1: Initial Startup (Single Node)

Phase 2: Cluster Expansion - Node 1 Joins

Phase 3: Leadership Transfer Request

Phase 4: Node 0 Wins Election

Phase 5: Node 2 Wins Election

Uh oh!

coderabbitai bot commented Dec 5, 2025

Review skipped

Uh oh!

CLAassistant commented Dec 5, 2025

Uh oh!

CLAassistant commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Dec 5, 2025 •

edited

Loading